pRactice corner: Clustering Analysis on Wine Dataset

lruolin

Summary

PCA is used as an exploratory data analysis tool, and may be used for feature engineering and/or clustering. This is a continuation of clustering analysis on the wines dataset in the kohonen package, in which I carry out k-means clustering using the tidymodels framework, as well as hierarchical clustering using factoextra pacage.

Workflow

Import data
Exploratory data analysis

skim
ggcorr
ggpairs

Check assumptions on whether PCA can be carried out

KMO
Bartlett

Carry out PCA using tidymodels workflow

Always use only continuous variables, ensure that there are no missing data. Determine the number of components using eigenvalues, scree plots and parallel analysis.

recipe : preprocess the data (missing values, center and scale, ensuring that variables are continuous)
prep : evaluate the data
bake : get the PCA Scores results
visualize
communicate results: show the scree plot, PCA loadings, variance explained by each component, loadings and score plot.

The loading shows the linear combinations of the original variables - ie the new dimension.

The scores show the coordinates of the individual wine samples in the new low-dimensional space.

Use the loadings to carry out k-means clustering and hierarchical clustering.

Loading packages

library(pacman)
p_load(corrr, GGally, tidymodels, tidytext, tidyverse, psych,
       skimr, gridExtra, kohonen, janitor, learntidymodels, kohonen,
       cluster, factoextra)

Import

This dataset is from the kohonen package. It contains 177 rows and 13 columns.

These data are the results of chemical analyses of wines grown in the same region in Italy (Piedmont) but derived from three different cultivars: Nebbiolo, Barberas and Grignolino grapes. The wine from the Nebbiolo grape is called Barolo. The data contain the quantities of several constituents found in each of the three types of wines, as well as some spectroscopic variables.

PCA analysis was performed earlier, and k-means clustering and hierarchical clustering analysis (HCA) will be built upon the PCA loadings.

data(wines)

wines <- as.data.frame(wines) %>% 
  janitor::clean_names() %>%  # require data.frame
  as_tibble() 
 
glimpse(wines) # does not contain the types of wine (Y variable)

Rows: 177
Columns: 13
$ alcohol          <dbl> 13.20, 13.16, 14.37, 13.24, 14.20, 14.39, …
$ malic_acid       <dbl> 1.78, 2.36, 1.95, 2.59, 1.76, 1.87, 2.15, …
$ ash              <dbl> 2.14, 2.67, 2.50, 2.87, 2.45, 2.45, 2.61, …
$ ash_alkalinity   <dbl> 11.2, 18.6, 16.8, 21.0, 15.2, 14.6, 17.6, …
$ magnesium        <dbl> 100, 101, 113, 118, 112, 96, 121, 97, 98, …
$ tot_phenols      <dbl> 2.65, 2.80, 3.85, 2.80, 3.27, 2.50, 2.60, …
$ flavonoids       <dbl> 2.76, 3.24, 3.49, 2.69, 3.39, 2.52, 2.51, …
$ non_flav_phenols <dbl> 0.26, 0.30, 0.24, 0.39, 0.34, 0.30, 0.31, …
$ proanth          <dbl> 1.28, 2.81, 2.18, 1.82, 1.97, 1.98, 1.25, …
$ col_int          <dbl> 4.38, 5.68, 7.80, 4.32, 6.75, 5.25, 5.05, …
$ col_hue          <dbl> 1.05, 1.03, 0.86, 1.04, 1.05, 1.02, 1.06, …
$ od_ratio         <dbl> 3.40, 3.17, 3.45, 2.93, 2.85, 3.58, 3.58, …
$ proline          <dbl> 1050, 1185, 1480, 735, 1450, 1290, 1295, 1…

EDA

Refer to the post for PCA of wine analysis

Tidymodels (PCA)

Recipe

The dataset did not include the y variable (type of wine), so the update_role() function will be omitted.

step_normalize() combines step_center() and step_scale()

Note that step_pca is the second step –> will need to retrieve the PCA results from the second list later.

wines_recipe <- recipe(~ ., data = wines) %>% 
  # update_role(vintages, new_role = "id") %>%  # skipped
  # step_naomit(all_predictors()) %>% 
  step_normalize(all_predictors()) %>% 
  step_pca(all_predictors(), id = "pca")

wines_recipe # 13 predictors

Data Recipe

Inputs:

      role #variables
 predictor         13

Operations:

Centering and scaling for all_predictors()
No PCA components were extracted.

Preparation

wines_prep <- prep(wines_recipe)

wines_prep # trained

Data Recipe

Inputs:

      role #variables
 predictor         13

Training data contained 177 data points and no missing data.

Operations:

Centering and scaling for alcohol, malic_acid, ... [trained]
PCA extraction with alcohol, malic_acid, ... [trained]

tidy_pca_loadings <- wines_prep %>% 
  tidy(id = "pca")

tidy_pca_loadings # values here are the loading

# A tibble: 169 x 4
   terms               value component id   
   <chr>               <dbl> <chr>     <chr>
 1 alcohol          -0.138   PC1       pca  
 2 malic_acid        0.246   PC1       pca  
 3 ash               0.00432 PC1       pca  
 4 ash_alkalinity    0.237   PC1       pca  
 5 magnesium        -0.135   PC1       pca  
 6 tot_phenols      -0.396   PC1       pca  
 7 flavonoids       -0.424   PC1       pca  
 8 non_flav_phenols  0.299   PC1       pca  
 9 proanth          -0.313   PC1       pca  
10 col_int           0.0933  PC1       pca  
# … with 159 more rows

Bake

wines_bake <- bake(wines_prep, wines)
wines_bake  # has the PCA SCORES to run HCA and k-means clustering

# A tibble: 177 x 5
     PC1    PC2    PC3      PC4     PC5
   <dbl>  <dbl>  <dbl>    <dbl>   <dbl>
 1 -2.22 -0.301 -2.03   0.281   -0.259 
 2 -2.52  1.06   0.974 -0.734   -0.198 
 3 -3.74  2.80  -0.180 -0.575   -0.257 
 4 -1.02  0.886  2.02   0.432    0.274 
 5 -3.04  2.16  -0.637  0.486   -0.630 
 6 -2.45  1.20  -0.985  0.00466 -1.03  
 7 -2.06  1.64   0.143  1.20     0.0105
 8 -2.51  0.958 -1.78  -0.104   -0.871 
 9 -2.76  0.822 -0.986 -0.374   -0.437 
10 -3.48  1.35  -0.428 -0.0399  -0.316 
# … with 167 more rows

Check number of PC

Only the scree plot is showed below. Refer to PCA analysis of wine for other options in determining number of PCs.

# b. Scree plot/Variance plot

wines_prep %>% 
  tidy(id = "pca", type = "variance") %>% 
  filter(terms ==  "percent variance") %>% 
  ggplot(aes(x = component, y = value)) +
  geom_point(size = 2) +
  geom_line(size = 1) +
  scale_x_continuous(breaks = 1:13) +
  labs(title = "% Variance explained",
       y = "% total variance",
       x = "PC",
       caption = "Source: Wines dataset from kohonen package") +
  theme_classic() +
  theme(axis.title = element_text(face = "bold", size = 12),
        axis.text = element_text(size = 10),
        plot.title = element_text(size = 14, face = "bold"))  # 2 or 3

Visualize

Loadings plot

plot_loadings <- tidy_pca_loadings %>% 
  filter(component %in% c("PC1", "PC2", "PC3")) %>% 
  mutate(terms = tidytext::reorder_within(terms, 
                                          abs(value), 
                                          component)) %>% 
  ggplot(aes(abs(value), terms, fill = value>0)) +
  geom_col() +
  facet_wrap( ~component, scales = "free_y") +
  scale_y_reordered() + # appends ___ and then the facet at the end of each string
  scale_fill_manual(values = c("deepskyblue4", "darkorange")) +
  labs( x = "absolute value of contribution",
        y = NULL,
        fill = "Positive?",
        title = "PCA Loadings Plot",
        subtitle = "Number of PC should be 3, compare the pos and the neg",
        caption = "Source: ChemometricswithR") +
  theme_minimal() +
  theme(title = element_text(size = 24, face = "bold"),
        axis.text = element_text(size = 16),
        axis.title = element_text(size = 20))


plot_loadings

# PC1: flavonoids, tot_phenols, od_ratio, proanthocyanidins, col_hue, 36%
# PC2: col_int, alcohol, proline, ash, magnesium; 19.2%
# PC3: ash, ash_alkalinity, non_flav phenols; 11.2%

Loadings only

# define arrow style
arrow_style <- arrow(angle = 30,
                     length = unit(0.2, "inches"),
                     type = "closed")

# get pca loadings into wider format
pca_loadings_wider <- tidy_pca_loadings%>% 
  pivot_wider(names_from = component, id_cols = terms)


pca_loadings_only <- pca_loadings_wider %>% 
  ggplot(aes(x = PC1, y = PC2)) +
  geom_segment(aes(xend = PC1, yend = PC2),
               x = 0, 
               y = 0,
               arrow = arrow_style) +
  ggrepel::geom_text_repel(aes(x = PC1, y = PC2, label = terms),
            hjust = 0, 
            vjust = 1,
            size = 5,
            color = "red") +
  labs(title = "Loadings on PCs 1 and 2 for normalized data") +
  theme_classic()

Scores plot

# Scores plot #####
# PCA SCORES are in bake
pc1pc2_scores_plot <- wines_bake %>% 
  ggplot(aes(PC1, PC2)) +
  geom_point(aes(color = vintages, shape = vintages), 
             alpha = 0.8, size = 2) +
  scale_color_manual(values = c("deepskyblue4", "darkorange", "purple")) +
  labs(title = "Scores on PCs 1 and 2 for normalized data",
       x = "PC1 (36%)",
       y = "PC2 (19.2%)") +
  theme_classic() +
  theme(legend.position = "none")

Finalised plots

grid.arrange(pc1pc2_scores_plot, pca_loadings_only, ncol = 2)

k-means clustering

The PCA scores will be used for clustering analysis

wines_bake

# A tibble: 177 x 5
     PC1    PC2    PC3      PC4     PC5
   <dbl>  <dbl>  <dbl>    <dbl>   <dbl>
 1 -2.22 -0.301 -2.03   0.281   -0.259 
 2 -2.52  1.06   0.974 -0.734   -0.198 
 3 -3.74  2.80  -0.180 -0.575   -0.257 
 4 -1.02  0.886  2.02   0.432    0.274 
 5 -3.04  2.16  -0.637  0.486   -0.630 
 6 -2.45  1.20  -0.985  0.00466 -1.03  
 7 -2.06  1.64   0.143  1.20     0.0105
 8 -2.51  0.958 -1.78  -0.104   -0.871 
 9 -2.76  0.822 -0.986 -0.374   -0.437 
10 -3.48  1.35  -0.428 -0.0399  -0.316 
# … with 167 more rows

Number of clusters?

There are 3 common ways for determining the number of clusters:

gap statistic method
within sum of square method
silhouette method

Let us look at all three of them.

Gap Statistic Method

gap_statistic <- cluster::clusGap(wines_bake,
                                  FUN = kmeans,
                                  nstart = 50, 
                                  K.max = 10, # max number of clusters
                                  B = 1000) # bootstrap

factoextra::fviz_gap_stat(gap_statistic) # theoretically should have only 3 clusters

Within Sum of Square Method

fviz_nbclust(wines_bake,
             kmeans,
             method = "wss") # this suggests 3 clusters, in line with theory

Silhouette Method

fviz_nbclust(wines_bake,
             FUN = hcut,
             method = "silhouette") # this suggests 3 clusters

All three methods agree that there should be 3 clusters. This may not always be the case. In any case, we know that there are 3 different types of wine in the dataset.

Tidymodels workflow for k-means clustering

# exploring different k numbers #####
kclusts_explore <- tibble(k = 1:10) %>% 
  mutate(kclust = purrr::map(k, ~kmeans(wines_bake, .x)),
         tidied = purrr::map(kclust, tidy),
         glanced = purrr::map(kclust, glance),
         augmented = purrr::map(kclust, augment, wines_bake))

kclusts_explore

# A tibble: 10 x 5
       k kclust   tidied            glanced          augmented        
   <int> <list>   <list>            <list>           <list>           
 1     1 <kmeans> <tibble [1 × 8]>  <tibble [1 × 4]> <tibble [177 × 6…
 2     2 <kmeans> <tibble [2 × 8]>  <tibble [1 × 4]> <tibble [177 × 6…
 3     3 <kmeans> <tibble [3 × 8]>  <tibble [1 × 4]> <tibble [177 × 6…
 4     4 <kmeans> <tibble [4 × 8]>  <tibble [1 × 4]> <tibble [177 × 6…
 5     5 <kmeans> <tibble [5 × 8]>  <tibble [1 × 4]> <tibble [177 × 6…
 6     6 <kmeans> <tibble [6 × 8]>  <tibble [1 × 4]> <tibble [177 × 6…
 7     7 <kmeans> <tibble [7 × 8]>  <tibble [1 × 4]> <tibble [177 × 6…
 8     8 <kmeans> <tibble [8 × 8]>  <tibble [1 × 4]> <tibble [177 × 6…
 9     9 <kmeans> <tibble [9 × 8]>  <tibble [1 × 4]> <tibble [177 × 6…
10    10 <kmeans> <tibble [10 × 8]> <tibble [1 × 4]> <tibble [177 × 6…

# turn this into 3 separate datasets, each representing a
# different type of data

#
clusters <- kclusts_explore %>% 
  unnest(cols = c(tidied))

clusters

# A tibble: 55 x 12
       k kclust       PC1       PC2       PC3       PC4       PC5
   <int> <list>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1     1 <kmea… -1.03e-15 -2.78e-16 -3.22e-16 -5.39e-17  1.76e-16
 2     2 <kmea…  1.91e+ 0 -8.65e- 2 -1.73e- 3  7.28e- 2 -1.41e- 2
 3     2 <kmea… -1.89e+ 0  8.56e- 2  1.71e- 3 -7.20e- 2  1.39e- 2
 4     3 <kmea…  2.71e+ 0  1.10e+ 0 -2.35e- 1 -6.17e- 2  7.64e- 2
 5     3 <kmea…  4.43e- 4 -1.76e+ 0  1.85e- 1 -7.36e- 2  7.54e- 2
 6     3 <kmea… -2.26e+ 0  9.55e- 1 -5.83e- 4  1.30e- 1 -1.44e- 1
 7     4 <kmea…  2.90e- 1 -1.77e+ 0 -8.54e- 1  4.55e- 1  1.35e- 1
 8     4 <kmea…  2.76e+ 0  1.21e+ 0 -1.52e- 1 -1.11e- 1  5.10e- 2
 9     4 <kmea… -2.39e+ 0  1.04e+ 0 -2.57e- 1  1.29e- 1 -2.19e- 1
10     4 <kmea… -3.41e- 1 -1.30e+ 0  1.28e+ 0 -4.39e- 1  1.17e- 1
# … with 45 more rows, and 5 more variables: size <int>,
#   withinss <dbl>, cluster <fct>, glanced <list>, augmented <list>

#
assignments <- kclusts_explore %>% 
  unnest(cols = c(augmented))

assignments  # can be used to plot, with each point colored according to predicted cluster

# A tibble: 1,770 x 10
       k kclust tidied glanced   PC1    PC2    PC3      PC4     PC5
   <int> <list> <list> <list>  <dbl>  <dbl>  <dbl>    <dbl>   <dbl>
 1     1 <kmea… <tibb… <tibbl… -2.22 -0.301 -2.03   0.281   -0.259 
 2     1 <kmea… <tibb… <tibbl… -2.52  1.06   0.974 -0.734   -0.198 
 3     1 <kmea… <tibb… <tibbl… -3.74  2.80  -0.180 -0.575   -0.257 
 4     1 <kmea… <tibb… <tibbl… -1.02  0.886  2.02   0.432    0.274 
 5     1 <kmea… <tibb… <tibbl… -3.04  2.16  -0.637  0.486   -0.630 
 6     1 <kmea… <tibb… <tibbl… -2.45  1.20  -0.985  0.00466 -1.03  
 7     1 <kmea… <tibb… <tibbl… -2.06  1.64   0.143  1.20     0.0105
 8     1 <kmea… <tibb… <tibbl… -2.51  0.958 -1.78  -0.104   -0.871 
 9     1 <kmea… <tibb… <tibbl… -2.76  0.822 -0.986 -0.374   -0.437 
10     1 <kmea… <tibb… <tibbl… -3.48  1.35  -0.428 -0.0399  -0.316 
# … with 1,760 more rows, and 1 more variable: .cluster <fct>

#
clusterings <- kclusts_explore %>% 
  unnest(cols = c(glanced))

clusterings

# A tibble: 10 x 8
       k kclust  tidied  totss tot.withinss betweenss  iter augmented 
   <int> <list>  <list>  <dbl>        <dbl>     <dbl> <int> <list>    
 1     1 <kmean… <tibbl… 1834.        1834. -2.50e-12     1 <tibble […
 2     2 <kmean… <tibbl… 1834.        1192.  6.42e+ 2     1 <tibble […
 3     3 <kmean… <tibbl… 1834.         820.  1.01e+ 3     2 <tibble […
 4     4 <kmean… <tibbl… 1834.         730.  1.10e+ 3     3 <tibble […
 5     5 <kmean… <tibbl… 1834.         659.  1.18e+ 3     3 <tibble […
 6     6 <kmean… <tibbl… 1834.         603.  1.23e+ 3     4 <tibble […
 7     7 <kmean… <tibbl… 1834.         562.  1.27e+ 3     3 <tibble […
 8     8 <kmean… <tibbl… 1834.         545.  1.29e+ 3     5 <tibble […
 9     9 <kmean… <tibbl… 1834.         471.  1.36e+ 3     3 <tibble […
10    10 <kmean… <tibbl… 1834.         465.  1.37e+ 3     4 <tibble […

#  visualize

# number of clusters
clusterings %>% # from glance
  ggplot(aes(k, tot.withinss)) + # total within cluster sum of squares, keep low
  geom_line() +
  geom_point() +
  scale_x_continuous(breaks = 1:10) +
  labs(title = "Plot of Total Within Sum of Squares for different number of clusters",
       subtitle = "Additional clusters beyond k = 3 have little value") +
  theme_classic()

# how datapoints are separated
glimpse(assignments)

Rows: 1,770
Columns: 10
$ k        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ kclust   <list> [<1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ tidied   <list> [<tbl_df[1 x 8]>, <tbl_df[1 x 8]>, <tbl_df[1 x 8]…
$ glanced  <list> [<tbl_df[1 x 4]>, <tbl_df[1 x 4]>, <tbl_df[1 x 4]…
$ PC1      <dbl> -2.223934, -2.524760, -3.744056, -1.017245, -3.040…
$ PC2      <dbl> -0.30145757, 1.05925179, 2.79737289, 0.88586726, 2…
$ PC3      <dbl> -2.0271695, 0.9739613, -0.1798599, 2.0181445, -0.6…
$ PC4      <dbl> 0.281108579, -0.733645703, -0.575492236, 0.4315681…
$ PC5      <dbl> -0.25880549, -0.19804048, -0.25714173, 0.27445613,…
$ .cluster <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

assignments %>% # from augment
  ggplot(aes(x = PC1, y = PC2)) + # use PCA data
  geom_point(aes(color = .cluster), size = 2, alpha = 0.8) +
  facet_wrap( ~ k) +
  # to see the center of the clusters
  geom_point(data = clusters, size = 9, shape  = "x") +
  labs(x = "PC1 (36% variance)",
       y = "PC2 (19.2% variance",
       title = "Visualization of k-means clustering",
       subtitle = "Optimal k = 3",
       caption = "Source: Wines dataset from kohonen package") +
  theme_minimal()

Hierarchical Clustering Analysis

wines_HC <- wines_bake %>% 
    dist(.,method = "euclidean") %>% 
    hclust(., method = "ward.D2")

# 3 clusters:
fviz_dend(wines_HC,
          k = 3,
          rect = T,
          rect_border = "jco",
          rect_fill = T)

Learning pointers:

Initially, I was stuck at the visualization part for k-means clustering as I didn’t know how to bring in my x and y-axis data. I had been using the original dataset all along, and was wondering why plots created using the factoextra::fviz_cluster() could report Dim 1 for x axis and Dim 2 for y axis. I finally had the eureka moment when I realised I should use the PCA scores from the bake step earlier.

I really like the tidymodels way of allowing for visualizing how the clusters are separated when different values of k are used. The functions augment, tidy and glance were very efficient in reporting the results for k-means clustering. Previously I only used tidy and glance for regression, and I didn’t know they could be extended to cluster analysis as well.

Lastly, I find dendrograms very aesthetically intuitive and I like how the colors and types of dendrograms could be customised. However, the assumption is that there must be some structure in the data in the first place, otherwise HCA would give very misleading results.

References

Comment on this article Share:

Clustering Analysis on Wine Dataset